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Abstract 


To prevent unintentional data leakage, research 
community has resorted to data generators that 
can produce differentially private data for model 
training. However, for the sake of the data pri- 
vacy, existing solutions suffer from either expen- 
sive training cost or poor generalization perfor- 
mance. Therefore, we raise the question whether 
training efficiency and privacy can be achieved 
simultaneously. In this work, we for the first time 
identify that dataset condensation (DC) which is 
originally designed for improving training effi- 
ciency is also a better solution to replace the tradi- 
tional data generators for private data generation, 
thus providing privacy for free. To demonstrate 
the privacy benefit of DC, we build a connection 
between DC and differential privacy, and theo- 
retically prove on linear feature extractors (and 
then extended to non-linear feature extractors) 
that the existence of one sample has limited im- 
pact (O(m/n)) on the parameter distribution of 
networks trained on m samples synthesized from 
n(n >> m) raw samples by DC. We also empiri- 
cally validate the visual privacy and membership 
privacy of DC-synthesized data by launching both 
the loss-based and the state-of-the-art likelihood- 
based membership inference attacks. We envision 
this work as a milestone for data-efficient and 
privacy-preserving machine learning. 


1. Introduction 


Machine learning models are notoriously known to suffer 
from a wide range of privacy attacks (Lyu et al., 2020), such 
as model inversion attack (Fredrikson et al., 2015), member- 
ship inference attack (MIA) (Shokri et al., 2017), property 
inference attack (Melis et al., 2019), etc. The numerous con- 
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Figure 1. DC-synthesized data can be used for privacy-preserving 
model training and cannot be recovered through MIA and visual 
comparison analysis. 


cerns on data privacy make it impractical for data curators to 
directly distribute their private data for purpose of interest. 
Previously, generative models, e.g., generative adversarial 
networks (GANs) (Goodfellow et al., 2014), was supposed 
to be an alternative of data sharing. Unfortunately, the afore- 
mentioned privacy risks exist not only in training with raw 
data but also in training with synthetic data produced by 
generative models (Chen et al., 2020b). For example, it is 
easy to match the fake facial images synthesized by GANs 
with the real training samples from the same identity (Web- 
ster et al., 2021). To counter this issue, existing efforts (Xie 
et al., 2018; Wang et al., 2021; Cao et al., 2021; Harder et al., 
2021) applied differential privacy (DP) (Dwork et al., 2006) 
to develop differentially private data generators (called DP- 
generators), because DP is the de facto privacy standard 
which provides theoretical guarantees of privacy leakage. 
Data produced by DP-generators can then be applied to 
various downstream tasks, e.g., data analysis, visualization, 
training privacy-preserving classifier, etc. 


However, due to the noise introduced by DP, the data pro- 
duced by DP-generators are of low quality, which impedes 
the utility as training data, i.e., accuracy of the models 
trained on these data. Thus, more data generated by DP- 
generators are needed to obtain good generalization perfor- 
mance, which inevitably decreases the training efficiency. 


Recently, the research of dataset condensation (DC) (Wang 
et al., 2018; Sucholutsky & Schonlau, 2019; Such et al., 
2020; Bohdal et al., 2020; Zhao et al., 2021; Zhao & 
Bilen, 2021b;a; Nguyen et al., 2021a;b; Jin et al., 2022; 
Cazenavette et al., 2022; Wang et al., 2022) emerges, which 
aims to condense a large training set into a small synthetic 
set that is comparable to the original one in terms of training 
deep neural networks (DNNs). Different from traditional 
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generative models that are trained to generate real-looking 
samples with high fidelity, these DC methods generate in- 
formative training samples for data-efficient learning. In 
this work, we for the first time investigate the feasibility of 
protecting data privacy using DC techniques. We find that 
DC can not only accelerate model training but also offer 
privacy for free. Figure 1 illustrates how DC methods can 
be applied to protect membership privacy and visual pri- 
vacy. Specifically, we first analyse the relationship between 
DC-synthesized data and original ones (Proposition 4.3 and 
4.4), and theoretically prove on linear DC extractors that the 
change caused by removing or adding one element in n raw 
samples to the parameter distribution of models trained on 
m(m <n) DC-synthesized samples (i.e., privacy loss) is 
bounded by O(m/n) (Proposition 4.10), which satisfies that 
one element does not greatly change the model parameter 
distribution (the concept of DP). The conclusions are further 
analytically and empirically generalized to non-linear fea- 
ture extractors. Then, we empirically validate that models 
trained on DC-synthesized data are robust to both vanilla 
loss-based MIA and the state-of-the-art likelihood-based 
MIA (Carlini et al., 2022). Finally, we study the visual 
privacy of DC-synthesized data in case of adversary’s direct 
matching attack. All the results show that DC-synthesized 
data are not perceptually similar to the original data as our 
Proposition 4.4 indicates, and cannot be reversed to the 
original data through similarity metrics (e.g., LPIPS). 


Through empirical evaluations on image datasets, we vali- 
date that DC-synthesized data can preserve both data effi- 
ciency and membership privacy when being used for model 
training. For example, on FashionMNIST, DC-synthesized 
data enable models to achieve a test accuracy of at least 
33.4% higher than that achieved by DP-generators under 
the same empirical privacy budget. Meanwhile, to achieve 
a test accuracy of the same level, DC only needs to synthe- 
size at most 50% data of the size required by GAN-based 
methods, which speeds up the training by at least 2 times. 


In summary, our contributions are three-fold: 


e To the best of our knowledge, we are the first to in- 
troduce the emerging dataset condensation techniques 
into privacy community and provide systematical audit 
on state-of-the-art DC methods. 


We build the connection between dataset condensation 
and differential privacy, and contribute theoretical anal- 
ysis with both linear and non-linear feature extractors. 


Extensive experiments on image datasets empirically 
validate that DC methods reduce the adversary ad- 
vantage of membership privacy to zero, and DC- 
synthesized data are perceptually irreversible to origi- 
nal data in terms of similarity metrics of Lz and LPIPS. 


2. Background and Related Work 


In this section, we briefly present dataset condensation and 
the membership privacy issues in machine learning models. 


2.1. Dataset Condensation 


Orthogonal to model knowledge distillation (Hinton et al., 
2015), Wang et al. firstly proposed dataset distillation (DD) 
which aims to distill knowledge from a large training set 
into a small synthetic set. The synthetic set can be used 
to efficiently train deep neural networks with a moderate 
decrease of testing accuracy. Recent works significantly 
advanced this research area by proposing Dataset Condensa- 
tion (DC) with gradient matching (Zhao et al., 2021; Zhao 
& Bilen, 2021b), Distribution Matching (DM) (Zhao & 
Bilen, 2021a) and introducing Kernel Inducing Points (KIP) 
(Nguyen et al., 2021a;b). For example, the synthetic sets (50 
images per class) generated by DM can be used to train a 3- 
layer convolutional neural networks from scratch and obtain 
over 60% testing accuracies on CIFAR10 (Krizhevsky et al., 
2009) and over 98% testing accuracies on MNIST (LeCun 
et al., 1998). In this work, we mainly focus on synthetic 
sets generated by DSA (Zhao & Bilen, 2021b), DM (Zhao 
& Bilen, 2021a) and KIP (Nguyen et al., 2021a), because 
1) DSA and DM are improved DC and KIP is improved 
DD, and 2) the performance of DD and DC are significantly 
lower than DSA, DM and KIP. 


We formulate dataset condensation problem using the sym- 
bols presented in (Zhao & Bilen, 2021a). Given a large-scale 
dataset (target dataset) 7 = {(a;, yi) } which consists of |7 | 
samples from C classes, the objective of dataset condensa- 
tion (or distillation) is to learn a synthetic set S = {(s;, yi) } 
with |S| synthetic samples so that the deep neural networks 
can be trained on S and achieve comparable testing perfor- 
mance to those trained on T: 


ie~ Pp [L( dor (x), y)] X Eg. Pp [L( des (x), y)], (1) 


where Pp is the real data distribution, ¢gr (-) and ¢gs(-) 
are models trained on 7 and S respectively. £(-,-) is the 
loss function, e.g. cross-entropy loss. 


To achieve this goal, Wang et al. proposed a meta-learning 
based method which parameterizes the model updated on 
synthetic set as 6° (S) and then learns the synthetic data by 
minimizing the validation loss on original training data T: 


arg min L7 (@°(S)), (2) 
Ss 


where 0°(S) = argming £5(@). The meta-learning al- 
gorithm has to recurrently unroll the computation graph 
6° with respect to S, which is expensive and unscalable. 
(Nguyen et al., 2021a) proposed Kernel Inducing Points 
(KIP) which leverages the neural tangent kernel (NTK) (Ja- 
cot et al., 2018) to replace the expensive network parameter 
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updating. With NTK, OË has a closed-form solution. Thus, 
KIP learns synthetic data by minimizing the kernel ridge 
regression loss: 


1 
arg min 5 ||ye — Kx,x,(Kx.x, + AD) Pul (3) 


where X, and X; are the synthetic and real images from S 
and T, ys and y; are corresponding labels. Kyy represents 
the NTK matrix (K(u,v))(u,.)eu,v for two sets U and 
V. For a neural network dg, the definition of K (u,v) on 
elements u and v is K(u,v) = Vogo lu) - Vodo(v). 


Zhao et al. proposed a novel DC framework to condense 
the real dataset into a small synthetic set by matching the 
gradients when inputting real and synthetic batches into the 
same model, which can be expressed as follows: 


T-1 
arg min Eeo~ Poo [So D( (Vol> (0+) VoL’ (4))], (4) 
t=0 


where model @; is updated by minimizing the loss £S (0+) 
alternatively, D computes distance between gradients. 
(Zhao & Bilen, 2021b) enabled the learned synthetic images 
to be effectively used to train neural networks with data 
augmentation by introducing the differentiable Siamese aug- 
mentation (DSA) A,,(-) and improved the matching loss in 
(4) as follows: 


D(VoL£(9:,Au(S)); VoL A(T) O 


Although (Zhao et al., 2021) successfully avoided unrolling 
the recurrent computation graph in (Wang et al., 2018), it 
still needs to compute the expensive bi-level optimization 
and second-order derivative. To further simplify the learning 
of synthetic data, (Zhao & Bilen, 2021a) proposed a simple 
yet effective dataset condensation method with distribution 
matching (DM). Specifically, the learned synthetic data S 
should have data distribution close to that of real data 7 in 
randomly sampled embedding spaces: 


ITI 


| 1 
O~ Ps wrOl val >, pol Alzi, w)) 


IS] 
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min 
S 


(6) 


where wy represents randomly sampled embedding func- 
tions (namely feature extractors), e.g. randomly initialized 
neural networks and A(-,w) is the differentiable Siamese 
augmentation. Experimental results show that this simple 
objective can lead to effective synthetic data that are compa- 
rable even better than those generated by existing methods. 


2.2. Membership Privacy 


For privacy analysis, we mainly focus on membership pri- 
vacy as it directly relates to personal privacy. For example, 
inferring that an individual’s facial image was in a shop’s 
training dataset reveals the individual had visited the shop. 
Shokri et al. have shown that DNNs’ output can leak the 
membership privacy of the input (i.e., whether the input 
belongs to the training dataset) under membership inference 
attack (MIA). In general, MIA only needs black-box ac- 
cess to model parameters (Sablayrolles et al., 2019) and 
can be successful with logit (Yu et al., 2021) or hard label 
prediction (Li & Zhang, 2021; Choquette-Choo et al., 2021). 


Loss-based MIA. The loss-based MIA infers membership 
by the predicted loss: if the loss is lower than a threshold 
T, then the input is a member of the training data. Formally, 
the membership M (x) of an input x can be expressed as: 


M(x) = 1(U(x) < 7), (7) 


where M(x) = 1 means x is a training member, 1(A) = 1 
if event A is true. The threshold 7 can be either chosen by 
locally trained shadow models (Shokri et al., 2017) or via 
optimal bayesian strategy (Sablayrolles et al., 2019). 


Likelihood-based MIA. Recent works (Carlini et al., 2022; 
Rezaei & Liu, 2021) pointed out that the evaluation of MIA 
should include False Postive Rate (FPR) instead of averaged 
metrics (e.g., attack accuracy, Area Under Curve (AUC) 
score of Receiver Operating Characteristic (ROC) curve), 
because MIA is a real threat only if the FPR is low (i.e., few 
data are inferred as member). Moreover, Carlini et al. also 
discovered that loss-based MIAs are hardly effective under 
constraint of low FPR (e.g., FPR < 0.1%). Hence, they 
devised a more advanced MIA, i.e. Likelihood Ratio Attack 
(LiRA) based on the model output difference caused by 
membership of an input. We consider the online LiRA attack 
because of its high attack performance. Particularly, the 
adversary first prepares shadow models ahead of the attack 
by sampling NV sub-datasets and training shadow models on 
each of the sampled dataset. Hence, for each data sample, 
there are + shadow models that are trained on it (called IN 
models) and the rest x that are not trained on it (called OUT 
models). The adversary then measures the means Hin, Mout 
and the variances o?,,02,,, of model confidence for IN and 


an? 


OUT models, respectively. Here, the confidence of model f 
for (x,y) is (f(x)y) = o(exp(—I( F(x), y))), where lis 
the cross-entropy loss and ¢(p) = log( 755) To attack, the 
adversary queries the victim model f with a target example 
(x, y) to estimate the likelihood A defined as: 


fo s in? z 
p(confobs|N (Hout Tout) 
where conf,,; = (f(x),) is the confidence of victim 
model f on target example (x,y). The adversary infers 
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membership by thresholding the likelihood A with thresh- 
old 7 determined in advance. 


3. Problem Statement 


In practice, companies may utilize personal data for model 
training in order to provide better services. For example, 
data holders (e.g., smart retail stores, smart city facilities) 
may capture clients’ data and send to cloud servers for 
model training. However, models trained on the raw data 
(i.e., T) can be attacked by MIA. In addition, transmitting 
raw data to servers suffers from potential data leakage (e.g., 
to honest-but-curious operators). Therefore, a better proto- 
col is to first learn knowledge from data by, for instance, 
generating synthetic dataset S from the raw data (i.e., 7), 
and then send S to the server for model training for down- 
stream applications. Formally, we define the threat model 
as follows: 


Adversary Goal. The adversary aims to examine the mem- 
bership information of the target dataset 7. Specifically, for 
a sample of interest x, the adversary infers whether x € 7. 


Adversary Knowledge. We assume a strong adversary 
(e.g., honest-but-curious server), who although has no ac- 
cess to 7 but has the white-box access to both the synthetic 
dataset S synthesized from the target dataset 7 and the 
model fs trained on the synthetic dataset. The adversary 
also knows the data distribution of 7. 


Adversary Capacity. The adversary has unlimited compu- 
tational power to generate shadow synthetic datasets on data 
of same distribution of 7 and train shadow models on them. 


Note that white-box access to the model parameters does 
not help MIA (Sablayrolles et al., 2019), so we omit other 
advantages brought by the white-box access to fs. 


4. Theoretical Analysis 


In this section, we theoretically analyse the relationship 
between the target dataset 7 and the synthetic dataset S of 
DM (improved DC) (Zhao & Bilen, 202 1a), and the privacy 
guarantees of 7 that are thereby provided. The reason of 
choosing DM is because of its high condensation efficiency 
and utility for model training. We also verify the difference 
between DM and other DC methods (see Appendix E.3) 
in terms of the synthetic data distribution, indicating our 
theoretical results can be generalized to other methods to 
some extent. Theoretical analysis of other DC methods (e.g., 
DSA) is left as the future work. 


Overview. We briefly present an overview of the analysis 
that consists of three parts. First, we clarify the assump- 
tions and notations in Section 4.1. Then, in Section 4.2, 
we analyse the connection between synthetic and original 


datasets for different DC initializations. Finally, in Section 
4.3, with conclusion of Section 4.2, we study the privacy 
loss of models trained on DC-synthesized data in a DP man- 
ner: how does removing one sample in the original dataset 
impact models trained on synthetic dataset. Because of the 
randomness in model training, we base on the model param- 
eter distribution assumption from (Sablayrolles et al., 2019) 
and compute the order of magnitude of the impact, which 
establishes the connection between DC and DP. 


4.1. Assumptions & Notations 


The DM loss (6) can be optimized for each class (Zhao 
& Bilen, 2021a). To simplify the notations, we consider 
only one class and omit the label vectors in the synthetic 
dataset S = {sj,--- ,S)s;} and the target dataset T = 
{X1,°** Xir}. We consider the following two assump- 
tions of the target dataset and the convergence of DM. 


Assumption 4.1. The linear span of the target dataset 
span(7 ) satisfies dr = dim(span(7)) < d, where d is 
the data dimension, dim(V) represents the dimension of 
vector space V, span(7 ) is the vector subspace generated 
by all linear combinations of T: 


IT] 


span(T) := © wixj|l <i < |T|, w; E R,x; ET}. 
i=l 
(9) 


In practice, dy can be computed as the rank of matrix 
form of T. This assumption generally holds for high di- 
mensional data and can be directly verified for common 
datasets (e.g., CIFAR-10). Without loss of generality, we 
consider an orthogonal basis (under inner product of R) 
E = {e,,--- ,eq} among which the first dy basis vectors 
Ey = {e1,--- ,ea,} form the orthogonal basis of vector 
subspace span(T ). 

Assumption 4.2. [Convergence of DM]. We assume there 
exists at least one synthetic dataset S* = {sj,--- ,S/5)} 
that minimizes (6). 


4.2. Analysis of Synthetic Data 
We first analyse synthetic data by linear extractors and then 
discuss the generalization to non-linear case (Remark 4.5). 


Proposition 4.3 (Minimizer of DM Loss). For a linear 
extractor We : RÌ > R* such that k < d, O € R**4, under 
Assumption 4.1 and 4.2, the dataset S* synthesized by DM 
from the target dataset T satisfies: 


1) The barycenters of S* and T coincide: 


ITI |S*| 


1 1 x o 
re pgs 


2) Ys% € S*,st = sře_ + S} ei where S} ¢, € span(T ), 


(10) 


Privacy for Free: How does Dataset Condensation Help Privacy? 


Lt . 
Si e1 € span(T )- that verifies 


Is| 
dose = Opn. (11) 
i=1 


The proof of Proposition 4.3 can be found in Appendix A. 
Note that the minimizer is s} = Tal S! x; when |S*| = 
1, indicating that the synthetic data falls into vector subspace 


span(7), confirming the Proposition 4.3. 


The DC initialization of synthetic dataset can be either real 
data sampled from 7 or random noise. Next, we study the 
impact of DM initialization and obtain the following results 
(proof can be found in Appendix B). 


Proposition 4.4 (Connection between S* and 7). Based 
on Proposition 4.3, the minimizer synthetic dataset S* = 
{s{, > Sis has the following properties for different ini- 
tialization strategies: 


1) Real data initialization. Assume that S is initialized with 
first |S| samples of T, i.e., Si = X;, then we have 


1 A 1 Jl 
ee egies 
j=l j=l 


2) Random initialization. The synthetic data are initial- 
ized with noise of normal distribution, i.e., Vs; E S,s8; ~ 
N (0, I4), and we assume the empirical mean is zeroed, i.e., 
al ye s; = 0, then we have 
s% = Sie, +8) eL» (13) 
where Sie = Sie, + mH al x; € span(T), and 
i, ET iET TI &j=1 Xi p ; 
E L 
St ea = Siet € (span(T))¢. 


Remark 4.5 (Non-linear Extractor). Our results can be gen- 
eralized to the non-linear extractors. Giryes et al. proved 
that multi-layer random neural networks generate distance- 
preserving embedding of input data, so (6) is minimized if 
and only if the distance between real and synthetic data 
is minimized. Take 2-layer random networks as an ex- 
ample (Estrach et al., 2014), there exist two constants 
0 < A < B such that V(x,y) € (R%)?, Allx—yll, < 
\|e(@x) — p(@y)||, < B||x — y||,, where p is RELU. We 
also analyse the case of 2-layer extractors (activated by 
ReLU) and found that the (pseudo)-barycenters of S* and 
T still coincide. Moreover, on convolutional extractors and 
the 2-layer extractors, we empirically verify Proposition 4.3 
in Appendix D (see Figure 7). 


Remark 4.6 (Impact of initialization on privacy). Note that 
in case of real data initialization, a higher |S| results in lower 
distance between barycenters of initialized S and 7, thus 
the changes brought to S become smaller when S becomes 


larger. This explains the phenomenon that DM-generated 
images are more visually similar to the real images for 
higher ipc (images per class) (Zhao & Bilen, 2021a). How- 
ever, as we demonstrate in our experiments (Section 5.2), 
the membership of data used for DC initialization can still 
be inferred by vanilla loss-based MIA. One countermea- 
sure is to choose hard-to-infer samples (Carlini et al., 2022), 
i.e., samples whose model outputs are not affected by the 
membership, as initialization data. 


On the other hand, data not used for initialization generate 
little effect (i.e., their weights in synthetic data are Or) 
on synthetic data, just as the case of random initialization, 
where data in 7 also generate little effect on S*. We demon- 
strate that the membership of those data cannot be inferred 
under both loss-based MIA and the state-of-the-art likeli- 
hood MIA (see Section 5.2). Moreover, projection com- 
ponent of (span(7))+ can further protect the privacy (e.g., 
visual privacy in Section 5.4). 

Remark 4.7 (Comparison between DC and GAN). The gen- 
erator of GAN is trained to minimize the distance between 
the real and the generated data distributions, which is simi- 
lar to the objective of DC. However, GAN-generated data 
share the same constraints (i.e., bounded between 0 and 1) 
as the real data. DC-generated data do not need to satisfy 
these constraints. This enables the DC-generated data to 
contain more features and explains the higher accuracy of 
models trained on DC-generated data (Zhao et al., 2021). 
We also empirically compare the accuracy of model trained 
on GAN-generated data and DC-synthesized data (see Sec- 
tion 5.3) , and found that DC-synthesized data outperform 
GAN-generated data for training better models with smaller 
amount of training data. 


4.3. Privacy Bound of Models Trained on Synthetic 
Data 


To understand how synthetic dataset protects membership 
privacy of 7 when being used for training model fs, we 
estimate how model parameters change when removing one 
sample from 7 by adopting below assumption. With a little 
abuse of notation, we denote the minimizer set S* by S 
when the context is clear. 


Assumption 4.8 (Distribution of model parameter (Sablay- 
rolles et al., 2019)). The distribution of model parameter 0 
given training dataset T = {x1,--- ,X)7)} and loss func- 
tion / is: 
1 IT] 
PIT) = aE Aa (14) 


where Kpy is the constant normalizing the distribution. 


Unlike widely known DP mechanisms (e.g., Gaussian mech- 
anism) that transform the deterministic query function into 
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a randomized one, randomness brought by optimization al- 
gorithm (i.e., SGD) or hardware defaults leads to different 
parameters each time of training, which justifies Assump- 
tion 4.8 and ensures the “uncertainty” in DP. In addition, 
we need the following assumption on the datasets S, T 
and the loss function l introduced in the Assumption 4.8. 
The assumption is valid for finite datasets and common loss 
functions (e.g., cross-entropy) and is used to quantify the 
data bound and loss variation. 


Assumption 4.9. We assume the data of 7 and S are 
bounded, i.e., 


JB > 0,Yx € TUS, |x|, < B. (15) 


The loss function }(0,-) : R — R* is L-Lipschitz accord- 
ing to the Lə norm, i.e., 


V(x, y) € B(B)’, 8, |l(8,x) — 1(0,y)| < Ł Ilx -= yll; 
(16) 
where B(B) = {x| ||x|| < B} is the close ball of space R°. 


With all previous assumptions, we have the following result. 


Proposition 4.10. Suppose a target dataset T = 
{X1,°-+ ,X|7|} and the leave-one-out dataset T" = T\{x} 
such that x is not used for initialization. The synthetic 
datasets are S and S' and |S| = |S'| < |T|. De- 
note the model parameter distributions of S and S' by 
p(0) = P(O|S) and q(0) = P(@|S’) respectively. Then, 


the membership privacy leakage caused by removing x is 
Dxr(plla) = OG). (17) 


Proposition 4.10 indicates that the adversary can only obtain 
limited information (i.e., oR) by MIA when the syn- 
thetic data is much fewer than the original data (|S| « |T|), 
which explains why synthetic data S protects membership 


privacy of model fs. The proof is in Appendix C. 


Connection to DP. Our privacy analysis is based on the 
impact of model parameter distribution by removing one 
element from the origin training dataset, which is similar 
to the definition of DP (Dwork et al., 2006). Formally, a 
e-differential privacy mechanism M satisfies: 


P(M(D) € Sm) 
P(M(D') € Sm) 


In <e (18) 
for all neighbor dataset pair (D, D’) and all subset S m of 
the range of M. Without knowledge of explicit form of 
model parameter distribution, we can only claim that the 
privacy budget e€ varies at the order of odi). 

In practice, we use an empirical budget ê through MIA 
(Kairouz et al., 2015) to measure the privacy guarantee 
against MIA. Typically, for a MIA that achieves FPR and 


TPR (True Positive Rate) against a model, the empirical 
budget is ê > ln(TPR/FPR). In other words, the model 
behaves €-differentially private to an adversary that applies 
MIA (..e., threat model in Section 3). 


Note that the empirical budget € is not equivalent to the 
real budget € because of different threat models (Nasr et al., 
2021). Nonetheless, we consider black-box MIA as the 
only privacy threat to the model, thus we can regard the DP 
budget € as a model privacy metric against MIA. In this way, 
we can compare € and e by the definition of ê. In Section 
5.3, we show that models trained on data synthesized by DC 
achieve ê ~ 2 against threat from the state-of-the-art MIA 
(LiRA), and obtain accuracy much higher than differentially 
private generators (Chen et al., 2020a; Harder et al., 2021), 
indicating DC is a better option for efficient and privacy- 
preserving model training. 


5. Evaluation 


In this section, we evaluate the membership privacy of fs 
for real data and random initialization. Then, we compare 
DC with previous DP-generators and GAN to demonstrate 
DC’s better trade-off between privacy and utility. Finally, 
we investigate the visual privacy of DC-synthesized data. 


5.1. Experimental Setup 


Datasets & Architectures. We use three datasets: Fashion- 
MNIST (Xiao et al., 2017), CIFAR-10 (Krizhevsky et al., 
2009) and CelebA (Liu et al., 2015) for gender classifica- 
tion. The CelebA images are center cropped to dimension 
64 x 64, and we randomly sample 5, 000 images for each 
class, which is same as CIFAR-10, while FashionMNIST 
contains 6, 000 images for each class. We adopt the same 3- 
layer convolutional neural networks used in (Zhao & Bilen, 
2021a) and (Nguyen et al., 2021b) as the feature extractor. 


DC Settings. One important hyperparameter of DSA, DM 
and KIP is the ratio of image per class rip. = oe We eval- 
uate ripe = 0.002, 0.01 for all methods, and for DM we add 
an extra evaluation ripe = 0.02 due to its high efficiency on 
producing large synthetic set. Note that ripe influences the 
model training efficiency: the lower ripc, the faster model 
training. We also consider ZCA preprocessing for KIP as 
it is reported to be effective for KIP performance improve- 
ment. Appendix E.1 contains more DC implementation 
details. 


Baselines. As for non-private baseline, we adopt subset 
sampled from 7 (this baseline is termed real data) and data 
generated by conditional GAN (cGAN or GAN for short) 
(Mirza & Osindero, 2014) which is trained on 7. For private 
baseline, we choose DP-generators including GS-WGAN 
(Chen et al., 2020a), DP-MERF (Harder et al., 2021) and DP- 
Sinkhorn (Cao et al., 2021). We compare the DC methods 
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with baselines in terms of privacy and efficiency. 


MIA Settings & Attack Metrics. For each dataset, we ran- 
domly split it into two subsets of equal amount of samples 
and choose one subset as 7 (member data). We then synthe- 
size dataset S on 7, and train a model fs (victim model) on 
S. The other subset becomes the non-member data used for 
testing the MIA performance. The above process is called 
preparation of synthetic dataset. 


For loss-based MIA, we repeat the preparation of synthetic 
dataset 10 times with different random seeds. This gives us 
10 groups of 7, S and fs. For each fs, we first select N 
member samples from 7 and N non-member samples, and 
choose an optimal threshold that maximizes the advantage 
score on the previously chosen 2N samples (Sablayrolles 
et al., 2019). The threshold is then tested on another dis- 
joint 2N samples composed by N member samples and N 
non-member samples to compute the advantage score of 
loss-based MIA. We report the advantage (in percentage) 
defined as 2 x (acc — 50%) where acc is the test accuracy 
of membership in percentage. 


For LiRA (Carlini et al., 2022), we repeat the preparation 
of synthetic dataset Nm times with different random seeds, 
and obtain Nm shadow 7, S and fis. We set Nm = 256 
for DM and Nm = 64 for KIP because of its lower training 
efficiency. We omit DSA for LiRA due to longer training 
time. To attack a victim model, we compute the likelihoods 
of each sample with Nm shadow fs and determine the 
threshold of likelihood according to the requirements of 
false positive. We use the Receiver Operating Characteristic 
(ROC) curve and Area Under Curve (AUC) score to evaluate 
the attack performance. Remark that we adopt the strongest 
(and unrealistic) attack assumption (i.e., the attacker knows 
the membership), so that we investigate the privacy of DC- 
synthesized data under the worst case. 


5.2. Membership Privacy of fs 
Table 1. Advantage (%) of loss-based MIA against models trained 


on real data (baseline) and data synthesized by DSA, DM and KIP 
with real data initialization. 


Method Tipe  FashionMNST CIFAR-10 CelebA 

Real 0.002 46.67 + 16.33 72.00 + 24.00 100.00 + 0.00 
(baseline, 0.01 21.00 + 3.67 92.80 + 5.31  84.00+ 5.06 
non-private) 0.02 17.33 + 2.91 82.60 +5.59 77.00 6.71 
0.002 78.17 +3.20 49.80 + 5.83 37.004 12.69 

DM 0.01 83.67 + 2.77 64.20 + 4.77 47.00+ 19.52 
0.02 83.00 + 2.56 68.20 +7.35 53.00 + 14.18 

DSA 0.002 74.40 + 2.65 55.40 +8.20 30.50 8.16 
0.01 81.60 + 2.27 56.60 + 2.95 28.00 + 3.74 
KIP 0.002 67.83 + 4.54 42.40 + 4.80 23.00 + 11.87 
(w/o ZCA) 0.01 70.00 + 2.47 51.40 + 5.73 25.00 + 15.65 
KIP 0.002 67.67+4.42 50.40 + 5.35 23.00+ 15.52 
(w/ ZCA) 0.01 64.00 + 4.23 48.40 +6.62 17.00 + 18.47 


Real Data Initialization Leaks Membership Privacy. We 
begin with membership privacy leaked by fs as mentioned 
in Remark 4.6 of Proposition 4.4. We aim to verify that 
DC with real data initialization still leaks membership pri- 
vacy of the data used for initialization. Here, the data used 
for initialization are sampled from 7 during each time of 
preparation of synthetic dataset. We launch loss-based MIA 
against fs and adopt the real data baseline. 


Table 1 shows the advantage score of loss-based MIA. Here, 
we vary a little bit the attack setting: the advantage scores are 
computed with the data used for real initialization and the 
same amount of member data not used for initialization but 
involved in DC. We observe that, on CIFAR-10 and CelebA, 
the synthetic dataset with real data initialization achieves 
lower advantage scores comparing to directly using real data 
for training (baseline). This can be explained by Proposi- 
tion 4.4, which tells us that the synthesized data deviates 
slightly from the real data used for initialization. However, 
on FashionMNIST, the baseline has lower advantage scores. 
We suspect this is because FashionMNIST images are grey- 
scale and synthetic data contain more features that prone to 
be memorized by networks. Loss distribution in Figure 8 
in Appendix E.2 also demonstrates that synthetic data with 
real data initialization can still leak membership privacy. 


Next, we show that models trained on data synthesized by 
DC with random initialization are robust against both loss- 
based MIA and LiRA (the state-of-the-art MIA). 


MIA Robustness of Random Initialization. Because of 
random initialization, each member sample contributes 
equally to the synthetic dataset. Thus, in this case, we fol- 
lows the loss-base attack setting and set N = 1000. Table 
2 provides the average and the standard variance of advan- 
tages for models trained on synthetic datasets by cGAN, 
DSA, DM and KIP. The advantages are around 0 for all 
Yipes Signifying that the adversary cannot infer membership 
of 7. Nevertheless, as long as the adversary has access 
to the generated images (which is included in our threat 
model), the membership of the GAN generator’s training 
data (i.e., 7) can still be leaked (see Appendix E.4). Mean- 
while, as we show later in Figure 3, models trained on 
DC-synthesized data achieve higher accuracy scores than 
baseline (i.e., cGAN-generated data), demonstrating the 
higher utility of DC-synthesized data. 


LiRA is a more powerful MIA because it can achieve higher 
TPR at low FPR (Carlini et al., 2022), while the adversary’s 
computational cost is higher. Figure 2 provides the ROC 
curves of LiRA against fs. We can observe that the ROC 
curves are close to the diagonal (red line) for all datasets 
and ripe. The AUC scores of ROC curves are around 0.5, 
indicating there is negligible attack benefit (low TPR) for 
the attacker compared with random guess. Recall that LiRA 
is evaluated on the whole dataset (half as member and other 
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Table 2. Advantage (%) of loss-based MIA against models trained 
on data synthesized by cGAN (baseline), DSA, DM and KIP with 
random initialization. 


Methods Tipe FashionMNST CIFAR-10 CelebA 
cGAN 0.002 .29 + 0.89 0.44+1.88 —0.57 + 0.97 
(baseline, 0.01 .18 + 1.21 0.58 +2.09 —0.81+ 0.95 
non-private) 0.02 .04 + 0.70 0.774159 —0.47 + 1.22 
0.002 —0.34+0.42 0.31+1.93 0.66 + 1.44 
DM 0.01 —0.29+0.48 1.06+1.20 0.56 + 1.52 
0.02 18 + 0.53 0.72 + 0.70 0.67 + 1.18 
DSA 0.002 .09 + 0.51 0.39 + 1.04 0.39 + 1.90 
0.01 .52 + 0.55 1.27+1.71  —1.16 + 0.90 
KIP 0.002 —1.13 +1.84 0.25 1.20 0.56 + 1.07 
(wloZCA) 0.01 -0.954+0.96 0.254180 —1.51 0.69 
KIP 0.002 0.56 + 2.02 0.6441.86 —1.06 + 1.10 
(w/ZCA) 0.01 1.69 + 1.96 0.22+ 1.27 —1.80 + 1.91 

half as non-member), the minimum FPR value is =+- = 2x 


5000 
1074 for CelebA, 4 x 10~° for CIFAR-10 and 3.33 x 107° 


for FashionMNIST. Therefore, when FPR is close to 0, the 
ROC curves have different shapes for different datasets. 
However, we also notice that the TPR is close to FPR when 
FPR is around the minimum (e.g., FPR~ 1074 for CelebA), 
demonstrating that models trained on data synthesized by 
DC with random initialization is robust to LiRA at low FPR. 


5.3. Comparison with Different Generators 


Table 3. Utility comparison of dataset synthesized by DP- 
generators, DM and KIP. The utility is measured by the accuracy 
(%) of models trained on the synthetic dataset. The results are 
estimated on FashionMNIST. 


Method DP Budget 0.002 Ot 0.02 
GS-WGAN e=10 53.53£042 51.85+0.54 50.10+0.32 
eae eae crete ger ret 

DP-Sinkhorn «=10 - - 70.9* 
KIP (wo ZCA) @=1.25 73.70£1.13 68.11 1.33 
KIP (w/ ZCA) €=2.07 74.3740.96 70.03 + 0.84 
DM ê=2.30 80.59+0.62 85.10+0.51 86.13 + 0.34 


* Results reported in the paper (Cao et al., 2021) (ripe = 1). 


GAN Generator. Figure 3 compares the accuracy scores 
of models trained on synthetic datasets (DC-synthesized 
with random initialization under ripe = 0.01). We can 
find that under the same constraint of training efficiency 
G.e., ripe = 0.01), the DM and DSA outperform the other 
methods. Note that models trained on KIP-synthesized data 
achieve lower accuracy than baseline because the loss is 
hard to converge for large ripe. Nevertheless, for small ripe, 
the KIP significantly outperforms baselines on CIFAR-10 
and CelebA (see Figure 10 in Appendix E). 


Then, we aim to know how DC improves model training 
efficiency compared to cGAN. In other words, to achieve 


the same accuracy of fs, the difference between the ripe 
that DC requires and the fipc that CGAN requires can be 
seen as the model training efficiency that DC improves. For 
different ripe (the x-axis), Figure 4 shows the accuracy of 
models trained on cGAN-generated dataset whose ratio is 
Tipe (the blue solid curve). The red and green horizontal 
lines represent the accuracy of fs trained on dataset syn- 
thesized by DSA and DM for ripe = 0.01, respectively. 
We omit KIP here because of its lower utility than base- 
lines. Therefore, the ripe of the intersection point of the red 
(resp. green) line and the blue curve is the ripe of cGANs- 
generated dataset on which the models can be trained to 
achieve the same accuracy as DSA (resp. DM). We can see 
that, cGAN needs to generate more data to train a model 
that achieves the same accuracy as models trained on data 
synthesized by DM and DSA, because the ripe values indi- 
cated in the x-axis are all higher than 0.01. It is worth noting 
that DC improves the training efficiency (measured by ripe) 
by at least 2 times than cGAN for ripe = 0.01, because 
on FashionMNIST (the leftmost sub-figure in Figure 4)), 
cGAN requires to generate synthetic dataset of ripe = 0.02 
to achieve the same accuracy (0.85) as the DM-synthesized 
dataset (ripe = 0.01). 


DP-generators. We estimate an empirical € based on the 
ratio of TPR and FPR computed by LiRA. In Table 3, we 
compare the accuracy of models trained on DC-synthesized 
data and on data generated by recent DP-generators. We 
reproduce DP-MERF and GS-WGAN according to the of- 
ficial implementation and adopt the reported results of DP- 
Sinkhorn. We observe that the accuracy of models trained 
on data generated by the state-of-the-art DP-generator (DP- 
Sinkhorn) is still lower than DM-synthesized images, even 
the ratio for DP-Sinkhorn is ripe = 1. The reason is that DP 
is designed to defend against the strongest adversary who 
has access to the training process of generator. Hence, data 
generated by DP-generators are of lower utility for model 
training because of the too strong defense requirement. 


5.4. Visual Privacy 


The adversary can directly visualize synthetic data and com- 
pare with the target sample to infer the membership. We 
visualize the synthetic images and use Lə distance as well 
as the perceptual metric LPIPS (Zhang et al., 2018) with 
VGG backbone to measure the similarity between synthetic 
and real images. Figure 5 shows examples of DM-generated 
images and the their (top 3) most similar real images, i.e., 
images of lowest Lz and LPIPS distance with the synthetic 
image on the top of the column. We observe that the real 
data share similar facial contour patterns with the synthetic 
images, but more fine-grained features, e.g., eye shape, are 
different, which explains why models trained on synthetic 
dataset protect the membership privacy of original data. This 
can also explain why current MIAs fail on models trained 
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Figure 2. ROC curves of LiRA against models trained on data synthesized by DM (left three figures) and KIP (right three figures). The 
solid, dashed and dotted lines stand for results of ripe = 0.002, 0.01 and 0.02, respectively. In KIP figures, the orange and blue lines 
represent the results of KIP with and without ZCA preprocessing, respectively. The red diagonal represents random guess and the AUC 
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scores of ROC curves are all under 0.51. 


caleba 


XARA 
CLLALE 
sosoo 
RRR 


Random Real 


waa Baseline ma DSA war DM WN KIP (w/o ZCA) s4 KIP (w/ ZCA) 


Figure 3. Accuracy of models trained on data synthesized by DSA, 
DM, KIP and on data generated by baselines for ripe = 0.01. The 
x-axis represents initialization strategy. For real data and random 
initialization, the baselines are real data and cGAN-generated data, 
respectively. 
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Figure 4. Accuracy of models trained on cGAN-generated data 
for different ripe. The horizontal lines are accuracy of models 
trained on data synthesized by DM (green, dashed) and DSA (red, 
dash-dotted) for ripe = 0.01. 


on synthetic datasets: the generated synthetic training data 
have lost the private properties of real data and thus the ad- 
versary are not able to infer the privacy from models trained 
on such synthetic data. 


6. Discussion and Conclusion 


In this work, we make the first effort to introduce the emerg- 
ing dataset condensation techniques into the privacy com- 
munity and provide systematical audit including theoretical 
analysis of the privacy property and empirical evaluation, 
i.e. visual privacy examination and robustness against loss- 
based MIA and LiRA on FashionMNIST, CIFAR-10 and 
CelebA datasets. 


Our future work will attempt to generalize the theoretical 
findings to other DC methods. This can be studied from the 
perspective of information loss (e.g. data compression ratio). 
Moreover, DC methods that satisfy formal DP formulation, 
e.g., (a, €)-Rényi DP (Mironov, 2017), are worth exploring. 
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Figure 5. Examples of facial images that are most similar to syn- 
thetic data generated by DM with random initialization. The value 
above each image is the distance (Lə and LPIPS) between the 
image and the synthetic data (first row). Lower distance indicates 
higher similarity. Even though these real images have similar face 
contour, blurred facial details (e.g., eyes, nose) make it difficult for 
the adversary to infer the membership. 


The current efforts of DC mainly focus on image classifica- 
tion, thus another interesting direction is the extension of the 
privacy benefit brought by DC to more complicated vision 
tasks (e.g., object detection) and non-vision tasks (e.g. text 
and graph related applications). In essence, DC methods 
can generalize to other machine learning tasks, as they learn 
the synthetic data by summarizing the distribution or dis- 
criminative information of real training data. Hence, their 
privacy advantage should also generalize to other tasks. 
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A. Proof of Proposition 4.3 


We begin our analysis of DM with the linear extractor we : R? — R* such that k < d, 0 = l0; ;] € IR**¢ and for an input x, 
we(x) = Ox. We also omit the differentiable Siamese augmentation to simplify the analysis. As the representation extractors 
we in DM are randomly initialized, we assume that the extractor parameters follow the standard normal distribution and 
are identically and independently distributed (i.i.d), i.e., 0i j SN (0, 1). Thus, Equation (6) becomes the expectation over 
ldpml|? where dp m is defined as: 

IT | IS] 


1 1 
dpm = a as j 2s) (19) 


Hence, Lom = Eo~w (0,1) ||dpu jj: The optimization of S with SGD relies on the gradient of (6). Given a sampled model 
parameter 0, for some synthetic sample s;, we have: 


ITI |S| 
ôlidpml|? 2 7 2 1 1 ee 
= dpm) 0= : \T. 9'0. 20 
ðs; a 8 = -lr — Tg] Le) (20) 
Hence, we obtain: 
7 ITI |S| 

dLpm _ Eo lldpm|? _ p Aldon _ 2,1 1 eee 

ðs; ðs, 085: iI 7 T ga 88)" BOA ce 


where (0 ' 6] = kl, by definition of 0, and I, is the identity matrix of Rt. Equation (21) indicates that the optimization 
direction of synthetic sample s; is the direction of moving barycenter of S to the barycenter of 7. To conceptually interpret, 
the optimization of (6) will move the initialized S until the barycenter coincides with that of 7 because of the existence of 
minimizer (Assumption 4.2) where the left hand-side of (21) should be 0. 


B. Proof of Proposition 4.4 


Case of real data initialization. Suppose that each s; € S is sampled from 7 and we can consider s; = x; as initialization 
for simplicity. According to (21), all s; are optimized until the barycenters of S and 7 coincide. Observe that each 
s; € span(7 ), because the projection components of span(7)+ remain zero throughout the optimization process of DM. 
Thus, one solution of minimizer elements s; € S* with the real data initialization can be: 


ITI IS] 


. 1 1 
st =x, 4 Pr] isi (22) 


When |S| and |T] are large (e.g., > 50), we can consider s} ~ x;, thus initialization with real data in DM still risks of 
membership privacy leakage. 


Case of random initialization. The synthetic data are initialized as vectors of multivariate normal distribution, i.e., 
Vs; E S, Si a N (0, Ia). (23) 


; ; ; ; iid. 

Each synthetic sample s; can be written as a vector [5;,1,--+ , Sia] under the basis E where Vj, Si j K N (0, 1), because 
s;’s covariance matrix remains identity matrix under any orthogonal transformation (i.e., orthogonal basis). Thus, we can 
decompose dp m to the projections on subspace span(7T ) and its orthogonal complement. Formally, we have 


i 2 ; 2 
dpm? = ||O1:a,-Proje, (Asr) + | Oa, .aProjex(As.r)| 
—— (24) 


lldarll2 
+ 


+2 < O1:d7Proje, (As,7), Oay:aProjes (As,T) >, 
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where @,., represents the submatrix composed by a-th column to the b-th column, Proj; is the projection operator onto 
subspace V and 


1 ITI 1 |S| 
A = — 4 TS i. 25 
ST mè“ sj 24° (25) 
Note that 
1 |S| 
Projex(As,r) = Tg) Y= Proje. (si), (26) 
sh 


because x; € span(7) = span(€7) for each x;. Let S; et = Proje. (s;), then we have 


2 
; O|ldpulles 2 J 


E gp a s;e1)'Eo[(Oaz:d) | Oar], (27) 
KET j=1 


because Eg [o]. a7 Oar: a| = 0. Therefore, the expectation of the above equation is the optimization direction of the projection 
of s; on the subspace (span(7))+: 


ÖS; eL 7 IS]? 


Ob air d|ldoullex — 2E[(0a,-a)" Oa p-al 
= Eo T = TE (> sje)". (28) 


Ss; 
0 LEF 


Note that E[(@a,.a) | Oap:a] = kIa—ar, thus the optimization direction is aligned with the barycenter of all S; g4 and will 
converge to 0 when 


1 IS] 
iS] 5 Siet = 0e. (29) 
=, 


Since the initialization of S is essentially noise of standard normal distribution, the empirical average of s; et is close to O 
(by law of large numbers), thus we can consider that the projection component of minimizer s¥ .., is close to the initialized 
eT 


value, 1.e., 


* NA 
Vs} gt € S83 eL X S; g4. (30) 


Similar as the case of real data initialization, the projection components on span(7T ) of s; are optimized to verify the first 
property of Proposition 4.3, i.e., the projection component of i-th minimizer s} -_ becomes 


1 ITI 1 |S| 
Sie, = Si Er + ITI Sox; = Is] 5 Sj Er: (31) 
j=1 j=1 


B.1. Empirical verification 


We empirically verify our conclusions for random and real data initializations in Figure 6. The images are synthesized from 
CIFAR-10 by DM using linear extractor of embedding dimension 2048, and each line contains images from the same class. 
On the right side, we plot the images synthesized with random initialization and real data initialization. We can observe 
that images synthesized with random initialization resemble combination of noise and class-dependent background, which 
verifies our conclusion of random initialization: synthetic data with random initialization are composed of barycenter of 
original data in space span and initialized noise in space span(T7 )+ (see (13)). Note that even in this case, models trained on 
synthetic data can still achieve validation accuracy around 27%. 


On the other hand, real data initialization generates little changes on the images used for initialization, which verifies the 
conclusion of real data initialization: synthetic data with real data initialization are composed of images used for initialization 
and the barycenter distance vector (see (12)). 


Besides linear extractor, we also investigate the impact of activation function. On the left of Figure 6, we show images 
synthesized by DM using ReLU-activated extractor (ReLU on top of linear extractor). We can see that the existence of 
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ReLU-activated Extractor Linear Extractor 
Random initialization Real data initialization Random initialization Real data initialization 


Figure 6. Synthetic images with random noise/real data initialization and with linear/ReLU-activated extractor of CIFAR-10. 


ReLU results in better convergence of DC and thus better image quality for both random and real data initialization. A 
potential reason is that ReLU changes the DC optimization and can lead to different local minima other than that found by 
using linear extractor. That is, data synthesized in this case are possibly composed by barycenter of a certain group of similar 
images (e.g., images of brown horse heading towards right with grass background) within the same class and a orthogonal 
noise vector. For example, CIFAR-10 synthetic images of class “horse” (third last line of leftmost figure in Figure 6) are 
noisy but contain different backgrounds which should be the barycenter of different image group of class “horse”: there are 
numerous horse images in CIFAR-10 where the horse head towards the right or left. This observation also confirms that 
ReLU improves generalization of neural networks. Appendix D encompasses more detailed analysis for non-linear extractor. 


C. Proof of Proposition 4.10 


We aim to quantify the membership privacy leakage of a member x with the Kullback-Leibler (KL) divergence of model 
parameter distributions. Without loss of generality, we study how the last element xy) influences the model parameter 
distribution. Let 7” denote T \ {x)7;}, where T = {x1,--- ,x)7)}. The synthetic datasets by DM based on 7 and 7” 
are noted as S and S’, respectively, and |S| = |S’|. In addition, we denote p(@) = P(@|S) and q(@) = P(@|S’). The KL 
divergence between p and q is 


|S| 


p(9) 1 p(9) 
Derx(pllq = f r(0)m Pao = | — exp(- 1(0,s1)) ln — d9, (32) 
rlllo = j Omo J, rs CXD nH 
where 
(0) [s] IS| 
In = 5-1(0,8)) — YO 1(0,8;) + Ks: — Ks 
q(9) i=l i=1 
IS] 
= X (I(0,s;) — 1(0,s:)) + Ks: — Ks (33) 
i=1 
IS] 
< LÝ lis; - sill, +|Ks — Ks|. 
i=1 
According to the assumption 4.8, Ks (similar for Ks») is: 
IS] 
Ks = f exp (—- S°1(6,s;))dd. (34) 
8 i=l 


Since x)7 is not used for real data initialization, according to the Proposition 4.4, if S and S’ share the same initialization 
ITI g p 
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type and initialized values, we have for each 7 


oma ITI 
/ —. F 
ls; — silla = mer" 7 mo] 
T|- 
o1 i TE (35) 
“plat 2 n 
j=l 
2B 
< —. 
< FF 
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The second term on the right side of (33) can be processed similarly: 


IS" Is 
|Ks — Ks| = J ew Ees) ) — exp (— aad 0,s;))d0 
8 i=1 
(37) 
IS] |S| 
= [lex we. s;) — 1(0,s;))) — 1] exp ( z (0,s;)) 
g i=1 
From previous analysis, we know that 
|s| 
2LB|S 
SUO, s) - 10,8) < FBIS, (38) 
z | 
Since exp(x) — 1 = O(x) in the neighborhood of 0, we have 
2LB |S| Ks IS] 
|Ker— Ks| = 0 ~F) = Ol). (39) 
IT] I7] 


Note that Ks should decrease as |S| increases because an additional synthetic sample s introduces a factor exp(—1(0,s)) < 1 
in the integral. We omit it here and assume Ks varies little when |S| changes. Together with (32) and (33), we obtain the 
privacy bound by KL divergence: 


Dxx(p\lq) = OG). (40) 


D. Generalization to Non-Linear Extractor 


We consider 2-layer network as the extractor, i.e., linear extractor with ReLU activation, and show that the (pseudo)- 
barycenters of S and 7 also coincide as claimed by Proposition 4.3. We then empirically validate the conclusion by plotting 
the Lə distance between S and 7 during the condensation by DM for different ripe on CIFAR-10 (see Figure 7). 


D.1. Analysis for 2-layer Network as Extractor 


With activation function ReLU (noted as p), Equation (19) becomes: 


IT | IS] 


qkeLlu . 
dpm = = Lee xi) Pa p(O - si). (41) 
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Since 0; j X N (0, 1) for each element 6;,; of 0, for an input x = [x;]ı<j<a € Rt, we have 


d 
y =0:x= [X Oi 50jli<ice = [yilisice € R*, (42) 


j=1 


where y; # N0, Zf az). Since p(x) := max(0,x), we have p(y) = [max(0, y:)]ı<i<x- Define Y = max(0, X) 
where the random variable X ~ M (0, o°). Then, Y follows the same distribution of B |X 


, where B ~ Bernoulli($) 


d 
independent of X and Ex[Y] = Eg[B]Ex [|X|]. Therefore, for each i, max(0, y;) = B; |yi| = Bi | 0 0;,;2;|, and we can 
j=l 
obtain 
p(y) = p(0x) = B © |Ax| (43) 
where © is element-wise multiplication, B = [B,]1<i<, and B; ng Bernoulli( i), With this in mind, Equation (41) 
becomes: 
1 IT] 1 |S| 1 ITI 1 |S| 
dpa” = val >c (9 Xi) — is] S> (0 si) = Bi © |Oxi| — is) Bi © |@si| , (44) 
i=l i=l i=1 i=1 


where vectors of Bernoulli random variable for each data samples B; are independent. To simplify notation, we consider 
k = 1. The vector of Bernoulli random variable reduces to single random variable B;, and the bold symbol d becomes d. 
Moreover, let sgn(x) denote the sign of x, and we can see that |x| = sgn(x)x for a real number x. Thus, with k = 1, we can 


reduce i to the similar form of (19): 


ITI IS] 
e 1 T 1 S 
Cae = Fi Ss B?sgn(0x;)Ox; — jsi 2 B sgn(0s;)0s; 
i=l i=l 
(45) 
ITI 1 |S| 
= eT 5 B7 sgn(0x;)xi — jsi 5 B?sgn(0s;)s;). 
i=l i=1 
Recall that for each j, ODL pm/08; = Eo[(d342" )?/0s;] = Eo [2(0d244" /0s;)d47"], and we have 
aap 3 : 
= —— Bésgn(0s,;)0. 46 
Os; |S| jsgn( s;) ( ) 
Thus, the gradient of Lpjy ons; becomes: 
Pou 2 z. Le 
DM = ip o|- —0' 0(— 5 B; sgn(0s;)B7sgn(0xi)xi — 75 5 Bjsgn(s;)B;sgn(s;)s;)] 
s; is mA isi 2 
2 1 ITI 1 IS] 
= Ws) 97 5 i p(B; B7] zo[sgn(0s;)sgn(0x;)0" 0]x; — is] 5 1 p(B; B;] o [sen(Os;)sgn(Os;)0 | 0]s;). 
i=1 i=1 
(47) 
Let M(x, y) denote Eo[sgn(0x)sgn(0y)0' 0] € R?*4, then the above equation becomes: 
ITI IS] 
Lpm 2 1 1 1 1 1 
= ( M(s;,Xi)x; — == —M(s;,s;)8; — — 8s;), (48) 
BST a Oo sj 2 a Merl aig 


because M(x,y) = I4 if x = y. Note that if x = —y, then M(x,y) = —I4. In fact, we can prove that 


x7 M(x,y)y = Eal(sen(0x)0x)" (sen(0y)oy)] = File a- 26) coslo) +2sin(6), (49) 
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Figure 7. Distance (||:||,) between barycenters of S and 7 deceases with the iteration round, which verifies first property of Proposition 
4.3. The solid and dashed lines represent real data and random initialization, respectively. The blue and orange lines represent the cases 
where ripe equals to 0.01 and 0.02, respectively. 


which can be seen as a matrix depending on the angle ¢ between x and y. Even though each original data x; is varied by 
M (s;,X;), their average can still be seen as a pseudo-barycenter, and the above equation signifies that each s; is updated 
towards minimizing the distance between the pseudo-barycenters of 7 and S, which verifies the first property of Proposition 


4.3 on non-linear extractor. This further validates the privacy property of DM which is based on the connection between S 
and T. 


Next, we empirically verify the Proposition 4.3 with tests on CIFAR- 10. 


D.2. Empirical verification of Proposition 4.3 for non-linear extractor 


Figure 7 shows the distance | r D Si — FT so) x,;||_ for each DM iteration on CIFAR-10. We can observe that the 
2 


barycenter distance decreases with the iteration round, and achieves to the minimum. Note that the right subfigure of Figure 
7 shows that the barycenters of 7 and S synthesized on 2-layer network (i.e., linear model activated by ReLU) have distance 
around 0, validating the theoretical analysis above. As for convolutional network (ConvNet), the distance decreases slower 
than 2-layer network. We suspect that the convolutional layers will lead the optimization to a local minimum. Figure 7 
also validates the impact of ripc and initialization to the distance of barycenters of S and 7: 1) when the iteration round is 
around 0, the distance of 100 image per class is smaller than that of 50 image per class, 2) the real initialization has much 
lower distance than of random initialization at the beginning of DM optimization. 


E. Additional Experimental Details and Results 


All experiments are conducted with Pytorch 1.10 on a Ubuntu 20.04 server. 


E.1. Details of Hyperparameters and Settings. 


DC Settings. We reproduced DM (Zhao & Bilen, 2021a) and adopt large learning rates to accelerate the condensation 
(i.e., 10, 50, 100 as learning rate for ripe = 0.002, 0.01, 0.02, respectively). For DSA (Zhao & Bilen, 2021b), we adopt the 
default setting ! for all datasets. For KIP, we reproduced in Pytorch according to the official code of (Nguyen et al., 202 1a), 
and set learning rate 0.04 and 0.1 for ripe = 0.002 and 0.01, respectively. Note that we omit ripe = 0.02 for KIP and DSA 
due to the low efficiency. We also apply differentiable siamese augmentations (Zhao & Bilen, 2021b) for both DM and KIP. 


E.2. Loss distribution of data used for DC initialization and test data on fs 


In Figure 8, we show the distribution of fs losses evaluated on data used for DC initialization (Real Init) and data not used 
for initialization (Other). We can observe that the losses of data used for initialization are smaller than other data, showing 


that the membership of data used for DC initialization are easier to be inferred. The distribution difference also explains the 
high advantage scores in Table 1. 


‘https://github.com/VICO-UoE/DatasetCondensation 


Privacy for Free: How does Dataset Condensation Help Privacy? 


FashionMNIST CIFAR-10 CelebA 
304 | Label Label Label 
<a | i Real Init Real Init Real Init 
OS 02 Other Other Other 
g | 
a | 
oo i- O = 
0 10 20 30 0 10 20 0 I 2 3 
Loss Loss Loss (log-scale) 
Pod | l Label Label Label 
ss | 1 Real Init Real Init === Real Init 
O8o2 E Other Other Other 
2 || 
a 
oo a= Bza- - 
ie) 10 20 30 0 10 20 30 (0) 1 2 3 
Loss Loss Loss (log-scale) 
T> 0.4 Label Label Label 
`Z Real Init Real Init == Real Init 
Z802 Other Other Other 
=O 
ac 
ge i | 
0.0 MĦ-==== Sieeee 2 = ——= 
0 10 20 0 10 0 1 2 3 
Loss Loss Loss (log-scale) 
TD 0.4 Label Label Label 
NZ Real Init == Real Init 1 Real Init 
pa 
28 02 Other Other Other 
ao 
yo | 
(0) (0) E am_an Bnnsnnz=z= Bme mi. = 
0 10 20 0 5 10 15 0 1 2 3 
Loss Loss Loss (log-scale) 


Figure 8. Loss distribution of data used for DC initialization (Real Init) are smaller than data not used for initialization (Other). 


E.3. Visualization of DC-synthesized data distribution 


Figure 9 shows the t-SNE visualization of CIFAR-10 and CelebA data synthesized by GAN and DC methods (DSA, DM 
and KIP without ZCA preprocessing). We clip the DC-synthesized into 0 and 1 for fair comparison with GAN-synthesized 
data. Note that the generated data distributions of DM and DSA are more similar than KIP and GAN, explaining why 
DM-synthesized data and DSA-synthesized data enable models to achieve higher accuracy under same rj. 


E.4. MIA against cGANs 


Our threat model assumes that the adversary has white-box access to the synthetic dataset. We apply the MIA against GANs 
proposed by Chen et al. (called GAN-leak). The main intuition is that member data are easier to be reconstructed by GAN 
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Figure 9. Distribution visualization of CIFAR-10 (left) and CelebA (right) synthesized by GAN, DSA, DM and KIP (without ZCA 
preprocessing). 
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Figure 10. Accuracy of models trained on data synthesized by different DC methods and on data generated by baselines for ripe = 0.002. 


generators G, so the MIA is based on the (calibrated) reconstructed loss Leat: 


M(x) = 1(Leai(x, G(z)) < 7). (50) 
The adversary optimizes Leca; by varying z to estimate whether x belongs to the training dataset. According to the adversary’ 
knowledge, the attack can be divided into black-box attack, partial black-box attack and white-box attack. We conducted 
the white-box attack for scenarios where the adversary has access to the generators. The results on CelebA are in Table 4, 
indicating that vanilla GAN can be used to infer the membership of training data. Chen et al. also validated that partial 
black-box attack can achieve similar attack performance as white-box, because the adversary has access to z and can 
leverage non-differentiable optimization, e.g., the Powell’s Conjugate Direction Method (Powell, 1964)), to approximately 
minimize Leal- 


Table 4. Results of GAN-leak attack against cGANs averaged over 10 shadow models. 
Dataset ROC AUC Advantage (%) 


CelebA 56.06 +2.03 22.984 4.27 


E.5. Comparison of accuracy for models trained on synthetic dataset for ripe = 0.002 


Figure 10 presents the accuracy comparison results of models trained on data synthesized by DC and baseline methods 
for ripe = 0.002. We can see that KIP significantly outperforms baselines and achieves similar performance with DSA 
and DM on CIFAR-10. Moreover, we can observe that the ZCA preprocessing is effective for improving the utility of 
KIP-synthesized dataset. 


